Extension of the Blahut-Arimoto algorithm for 
maximizing directed information 



We extend the Blahut-Arimoto algorithm for maximizing Massey's directed information. The algorithm can be 
used for estimating the capacity of channels with delayed feedback, where the feedback is a deterministic function 
of the output. In order to do so, we apply the ideas from the regular Blahut-Arimoto algorithm, i.e., the alternating 
maximization procedure, onto our new problem. We provide both upper and lower bound sequences that converge to 
the optimum value. Our main insight in this paper is that in order to find the maximum of the directed information over 
causal conditioning probability mass function (PMF), one can use a backward index time maximization combined 
with the alternating maximization procedure. We give a detailed description of the algorithm, its complexity, the 
memory needed, and several numerical examples. 

Index Terms 

Alternating maximization procedure. Backwards index time maximization, Blahut-Arimoto algorithm. Causal 
conditioning. Channels with feedback. Directed information. Finite state channels, Ising Channel, Trapdoor channel. 

I. Introduction 

In his seminal work. Shannon HI showed that the capacity of a memoryless channel is given as the optimization 
problem 



Since the set of all p{x) is not of finite cardinality, an optimization method is required to find the capacity C. In 
order to obtain an efficient way to calculate the global maximum in ([T]i, the well-known Blahut-Arimoto algorithm 
(referred to as BAA) was introduced by Blahut Q and Arimoto |l3| in 1972. The main idea is that we can calculate 
the optimum value using the equality 
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i.e., we can maximize over p{x) and p{x\y), instead of just p{x) alone. The maximization is then achieved using 
the alternating maximization procedure. The convergence of the alternating maximization procedure to the global 
maximum was proven by Csiszar and Tusnady lIU, and later by Yeung ||5|. 

In this paper, we find an efficient way to estimate the capacity of channels with feedback. It was shown by 
Massey ||6l, Kramer Q, Tatikonda and Mitter JS), Permuter, Weissman, and Goldsmith Q, and Kim IfTOl . that the 
expression 

C„ = - max J(X" ^ y") 
has an important role in characterizing the feedback capacity, where 



is the directed information, and p{y^\\x'^) is a causally conditioned PMF (definitions in Section [III) given by 

n 

piyV) = Y[p^y^\f-\^l■ (2) 

1=1 

Since in the maximization we deal with causally conditioned PMFs, trying to follow the regular BAA will result 
in difficulties. This is due to the fact that a causal conditioned PMF is the result of multiplications of conditioned 
PMFs as seen in (|2]). While in the regular BAA we maximize over p(a;"), and thus the constraints are simply 
X]£!;"P(^") — 1 ^iid p{x") > 0, in our extended problem we have no efficient way of writing all the constraints 
necessary for a causally conditioned PMF. In fact, we need n simple constraints, one for each product of 
Another difficulty is that although the equality 

n 



i=l 



holds, we cannot translate the given problem into 



max. I{X,;Yr\X^-\Y^-') 

since p{xi\x'^~^ , y'~^) influence all terms {I{Xj; y^'^lX^"^, y^^^)}"^j. A solution could be to maximize backwards 
from i = 71 to i = 1 over p{xi\x'^'^^, y'^^), and it can be shown that in each maximization, the non-causal probability 
p{xi\x'^^^ ,y") is determined only by the previous p{xj\x^^^ ,y^^^) for j > i. In our solution, we maximize the 
entire expression /(X" y") as a function of {p(a;i),p(x2|a:;i, yi), ...,p(a;„|a:;"~^, ?/"~^),p(a;"|y")}. Each time 
we maximize over a specific p{xi\x^^^ ,y^^^) starting from i ~ n and moving backwards to i = 1, where all but 
p(xi|a;'~^, are fixed. 

Before we present the extension of the BAA to the directed information, let us present some of the other 
extensions of this algorithm. In 2004, Matz and Duhamel ifTTI proposed two Blahut-Arimoto-type algorithms that 
often converge significantly faster than the standard Blahut-Arimoto algorithm, which relied on following the natural 
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gradient rather than maximizing per variable. During that year, Rezaeian and Grant lfT2l generalized the regular BAA 
for multiple access channels, and Dupuis, Yu, and Willems extended the BAA for channels with side information 
lfT3l . They used the fact that the input is a deterministic function of the auxiliary variable and the side information, 
and then extended the input alphabet. Another solution to the side information problem was given by El Gamal and 
Heegard lfT4l . where they did not expand the alphabet, but included an additional step to optimize over p{x\u, s). 
Also, the BAA was used by Egorov, Markavian, and Pickavance [15| to decode Reed Solomon codes. In 2005 
Dauwels ||T6| showed how the BAA can be used to calculate the capacity of continuous channels. Dauwels's main 
idea is the use of sequential Monte-Carlo integration methods known as the "particle filters". In 2008 Vontobel, 
Kavcic, Arnold, and Loeliger iflTl extended the regular BAA to estimate the capacity of finite state channels where 
the input is Markovian. Sumszyk and Steinberg ifTSl gave a single letter characterization of the capacity of an 
information embedding channel and provided a BA-type algorithm for the case where the channel is independent 
of the host given the input. 

Recently, few papers about the maximization of the directed information using control theory and dynamic 
programming were pubhshed. In |fT9l . Yang, Kavcic and Tatikonda maximized the directed information to estimate 
the feedback capacity of finite-state machine channels where the state is a deterministic function of the previous 
state and input. Chen and Berger 1201 maximized the directed information for the case where the state of the 
channel is known to the encoder and decoder in addition to the feedback link. Later, Permuter, Cuff, Van Roy and 
Weissman ||2T1 maximized the directed information and found the capacity of the trapdoor channel with feedback. 
In II22I . Gorantla and Coleman estimated the maximum of directed information where they considered a dynamical 
system, whose state is an input to a memoryless channel. The state of the dynamical system is affected by its past, 
an exogenous input, and causal feedback from the channel's output. 

The remainder of the paper is organized as follows. In Section HI] we present the notations we use throughout 
the paper, and give the outline for the alternating maximization procedure as given by Yeung jS). In Section Hill we 
give a description of the algorithm for solving the optimization problem- maXp^j-nHj^n-i-) /(X" — > Y"), calculate 
the complexity of the algorithm and memory needed, and compare it with those of the regular BAA. In Section |IV] 
we derive the algorithm using the alternating maximization procedure, and show the convergence of our algorithm 
to the optimum value. Numerical examples for channel capacity with feedback are presented in Section |V] In 
Appendix |A] we give a wider angle on the feedback channel problem, where the feedback of the channel is a 
deterministic function / of the output with some delay d; namely, we derive the algorithm for the optimization 
problem maXp^^nn^n-d-) /(X" — > F"), where z,; = /(j/i) and d > 1. In Appendix IB] we prove an upper bound for 
maxp(2,n||yn-d) /(X" 5^"), which converges to the directed information from above and helps determining the 
stoping iteration of the algorithm. 
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II. Preliminaries 

A. Directed information and causal conditioning 

In this section we present the definitions of directed information and causally conditioned PMF, originally 
introduced by Massey ||6l (who was inspired by Marko's work ||231 on Bidirectional Communication) and by 
Kramer ||7|. These definitions are necessary in order to address channels with memory. We denote by X" the 
vector {Xi,X2, ■■■Xn). Usually we use the notation X" = X" for short. Further, when writing a PMF we simply 
write Px{X = x) = p{x). Let us denote as jy"^'') the probability mass function (PMF) of X" causally 

conditioned on y"^'*, given by 

n 

p{x^\\y--'')^\{p{x^x^-^f~''). (3) 

1=1 

Here we have to establish that when d > n, the vector X"""* = 0. Two straight forward properties of the causal 
conditioning PMF that we use throughout the paper are 

^p(x"||y"-'^)=p(x"-i||y"-'^-i), (4) 

and 

Another elementary property is the chain rule for directed information 

p(x"||y"-iMy"||x")=p(x",y"). (6) 
The definitions above lead to the causally conditioned entropy which is given by 

^ -E [iogp(x"||y")]. 

Moreover, the directed information from X" to F" is given by 

/(x"^F")^ff(y")-ff(y"||X"). (7) 

It is possible to show, that we can write the directed information as such: 

We refer to this form when using the alternating maximization procedure since {r = r{x'^\\y^~^), q = (7(x"|y")} 
are the variables we optimize over where is fixed. For convenience, we use from now on the notation of 

/(X" ^ y") =I(r,q) (8) 

when required. With these definitions, we follow the alternating maximization procedure given by Yeung IS) in 
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order to maximize the directed information. 

B. Alternating maximization procedure 

Here, we present the ahernating maximization procedure on which our algorithm is based. Let /(ui,U2) be a 
real function, and let us consider the optimization problem given by 

sup /(Ui,-U2) /*• 

We denote by C2(ui) G A2 the point that achieves sup„2gA2 "2), and by ci{u2) E A\ the one that achieves 
sup^j^g^^ f{ui,U2)- The algorithm is defined by iterations, where in each iteration we maximize over one of the 
variables. Let (u", itj) be an arbitrary point in Ai x ^2- For k > let 

K,U2) = (C1(W2"')>C2(C1(U^-1))), 

and let f'^ = /(itj,it|) be the value if the present iteration. The following lemma describes the conditions the 
problem needs to meet in order for to converge to /* as k goes to infinity. 

Lemma 1 (Lemmas 9.4, 9.5 in Convergence of the alternating maximization procedure) . Let f{ui , W2) be a real, 
concave, bounded from above function that is continuous and has continuous partial derivatives, and let the sets 
Ai,A2, which we maximize over, be convex. Further, assume that C2(ui) G A2 and Ci(w2) € Ai for all ui E 
Ai, U2 E A2. Under these conditions, lim^^oo ]^ = ./*■ 

In Section Hn] we give a detailed description of the algorithm that computes maXp(^Ti||y„-i) I[X" — > F") based 
on the alternating maximization procedure. In Section |IV] we show that the conditions in Lemma [T] hold, and 
therefore the algorithm we suggest, which is based on the alternating maximization procedure, converges to the 
global optimum. 

III. Description of the algorithm 

In this section, we describe an algorithm for maximizing the directed information. In addition, we compute the 
complexity of the algorithm per iteration, and compare it to the complexity of the regular BAA. The memory 
calculation is also given. 

A. The algorithm for channel with feedback 

In Algorithm [T] we present the steps required to maximize the directed information where the channel p{y'^\\x^) 
is fixed and the delay is d = 1. Note that the regular BAA has a structure similar to that of Algorithm [l] where 
step (b) is an additional backward loop. Its purpose is to maximize over the input causal probability, which is not 
necessary in the regular BAA. 

Now, let us present a special case and a few extensions for Alg. [U 
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Algorithm 1 Iterative algorithm for calculating maxp(2.n||j,n-i) J(X" 5^"), where p(j/"||a;") is fixed. 

(a) Start from a random point (/(a;"!?/"). Usually we start from a uniform distribution, i.e., q{x"\y^) = 2^" for 
every (a;", y") 

(b) Starting from i = n, calculate r{xi\x'^~^ ,y'^~^) using the formula 



where 



rix,\x'-\y^-') 



and do so backwards until i = 1. 

(c) Once you have r(a;i|x'~^, y'"^) for all i £ {l,...,n}, compute r(x"||y"^^) = n"=i ^(^^l^^'"^; 

(d) Compute g(a;"|y") using the formula 



(9) 



(10) 



r(a;"||y"-i)p(y"||x") 



(11) 



(e) Calculate In — II, where 



Ijj = — maxy^max--- maxy^p(y"||x") log — 



p(y"||x'") • r(x'"||y"-i) ' 



(f) Return to (b) if [lu -Il)>^- 

(g) C„ = /l. 



(1) Regular BAA, i.e., n = 1. For n = 1, the algorithm suggested here agrees with the original BAA, where instead 
of steps (b), (c) we have 



and step (d) is replaced by 



r(x)p(y|x) 



The bounds , lu agree with the regular BAA as well, and are of the form 

q{x\y) 



Il = ^p{y\x)r{x) log ■ 



r(x) 



lu ^ max p[y\x) log- 



(2) Feedback with general delay d. We can generaUze the algorithm in order to compute max^jj-nHj,™-,!-) /(X" — > 
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F"), where the feedback is the output with delay d. In that case, in step (b) we have 

-in';=.-^+ip{vj\^',v'-')U';=,+ir{x,\x^-\v^-') 



, (14) 



and step (d) will be replaced by 

T,x'-'^i^''\\y''~'^)piy"'\\^") 

The bounds 1^, Ijj are of the form 



(15) 



r(a;"||y"-'')' 



p(y"'||a;"') 



(3) Feedback as a function of the output with general delay. In Appendix [A) we generalize the algorithm in order 
to compute maXr(2,n||2n-d-) /(X" — > Y"^), where the feedback 2;""'^ is a deterministic function of the delayed 
output. The expression characterizes the capacity of channels with time-invariant feedback Q. In that case, in 
step (b) we have 

(16) 



^"+l-!/,"-d + l ^i.d,z 



where we define the set Ai^d.z — {y*"'' : •z*"'' = /(y* '')} as the set of output sequences that / transforms to 
z^^'^, and step (d) will be replaced by 

r{x"\\z"-'^)p{y"\\x") 



(17) 



The bounds J^, Ijj are of the form 



n t — ^ r(x ||z ) 



Ijj = — maxy^ max • • • max log =^ — 

n X'' ^ 2:d+i ^ x„ f-^ ^ \ 



p(y"||a;") 



Note, that for d = ri, the vector 2"^'' = 0, hence r(xi|x'~^, z*"'') = r(xi|x'~^), and 



p(y"||a;'") • r(x'"||z"-'') ' 



Also note that when /(y) = const, r(a;"||z" = r(a;"'), Ai.d,^ = y* and X^j^^-d n}=i-P(yil^"''y"' ^) = 1- 
each of the cases above (d = n or f{y) ~ const), in step (d) we have 

r(a;")p(?/"||a;") 



g(x"|y") 
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and we obtain a different version of the regular BAA for channel capacity, where the maximization is done over 
all r(xi|x'~^) instead of over 7'(.t") at once. Furthermore, if f{y) ~ y then case (3) agrees with all the equations 
of case (2). 

B. Complexity and Memory needed 

Here, we give an expression for the computation complexity of one iteration in the algorithm, and then compare 
it to regular BAA. This will be done in two parts, one for each step in the iteration. 

(1) Complexity of computing q{x"'\y^) as given in (fTTT i. For each y", we need \X\"' multiplications for a specific 
a;" and use the denominator computed for every other x", thus obtaining OdA"!") operations. Doing so for 
all y" achieves 0{\X\''\y\'') = 0((| A"] |3^|)"). 

(2) Complexity of computing 7-(a;"||?/"~^). First, we compute the complexity of each 7'(a;j|a;*~^,y*~^) as given 
in ( fTOb . assuming that an exponent is a constant number of computations, i.e., 0(1). Simple computations 
will conclude that the entire numerator takes about 0{{n — computations. The denominator 
is a summation over \X\^ variables, and as with q{x"\y'^), we can use the denominator for every other 
x*. Hence, we obtain 0{{n — i){\X\\y\Y) computations for every i E {l..n}. Summing over i will achieve 
0((ri+n^)(|<Y 113^1)") ~ 0{n'^{\X\\y\)") computations. Multiplying all r{xi\x^~^ ,y^~^)s is a constant number 
of computations for every {xi,yi). Finally, in order to compute r{x"\\y^^^) we need 0{{n^ + 7i)(| A"! |3^|)") 
computations. 

To conclude, each iteration requires about 0(7i^(|A||3^|)") computations. 

Comparing to regular BAA: Since BAA computes the capacity of memoryless channels, we only need to compute 
r{x) and q{x\y). In much the same way, we can have its complexity and achieve 0((| A"! |3^|)) computations. 
However, if we want to compare it to BAA for channels with memory, we replace X X", Y <^ Y" But, 
I A"" I = I A I" and so we obtain 0((|A||3^|)") computations. The memory needed for the algorithm is very much 
dependent on the manner in which one implements the algorithm. However, the obligatory memory needed is for 
q, p, and r and its products; thus we need at least 7i(|A'||3^|)" cells of type double. Computation complexity and 
memory needed are presented in Table |l] 

TABLE I: Memory and operations needed for regular and extended BAA for channel coding with feedback. 





Operation 


Memory 


maxp(^) )), regular BAA for channel capacity 


0{i\x\\y\r) 


(1-^113^1)" 


maxp(,„||,y„-i) ^ r")), Alg.IU 


oin^i\x\\y\r) 


ndA-llJ^ir 
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IV. Derivation of Algorithm[T] 



In this section, we derive Algorithm[T]using the alternating maximization procedure, and conclude its convergence 
to the global optimum using Lemma [T] Throughout the paper, note that the channel is fixed in all 

maximization calculations. For this purpose we present several lemmas that will assist in proving our main goal: an 
algorithm for calculating max/(X" F"). In Lemma |2] we show that the directed information function has the 
properties required for lemma [T] In Lemma [3] we show that we are allowed to maximize the directed information 
over r(a;"||y"~^) and combined, rather than just over r(a;"| jy""^), thus creating an opportunity to use 

the alternating maximization procedure for achieving the optimum value. Lemma |4] is a supplementary claim that 
helps us prove Lemma |3] in which we find an expression for q(x"|y") that maximizes the directed information 
where r{x'^\\y"~^) is fixed. In Lemma|5]we find an explicit expression for that maximizes the directed 

information where q{x"\y'"') is fixed. Theorem [T] combines all lemmas to show that the alternating maximization 
procedure as described by II in Alg. [T] exists and converges. We end with Theorem |2] that proves the existence of 
the upper bound, Ijj . 

Lemma 2 . For a fixed channel p(y"||x"), the directed information given by 

/(X" ^ r") = |:^p(y"||x")r(x"||y-^)log ^g:^y_\^ (18) 

as a function of {r = 7-(x"||y"~^), q ~ q{x''^\y")} is concave, continuous and has continuous partial derivatives. 

Proof: First we need to show that the directed information can be written as above by using the causal 
conditioning chain rule. 



nil n\ I rill n-lM | |x" )r (x" 1 ^ ) 



Then we recall the log-sum inequality II24I Theorem 2.7.1] given by 




E«aog^> (y^a,)iog^ip4i. (i9) 



i=\ \i=\ 



We define the sets 

Ax = {r(a::"||j/""^) : r(a;"||y"^^) > is a causally conditioned PMF}, 

A2 ^ {qix^W) : g(a;"|?;") is a conditioned PMF}, (20) 

as the sets over which we maximize. Now, for (ri,q]^), (r2,q2) vs\ A ^ A\ y. A2 and A G [0,1], by using the 
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log-sum inequality given above we derive that 



(An + (1 - A)r2) log f^ + il ^l''^ < An log ^ + (1 - A)r2 log ^. 

A^i + (1 - X)q2 qi q2 

Taking the reciprocal of the logarithms yields 

An + 1 - A ra) log — -— > An log — + 1 - A ra log — . 

Ari + (1 - A)r2 7'i r2 

Multiplying by p(y"||a;") and summing over all x", y", and letting I(r,q) be the directed information as in dH, 
we obtain 

T(Ari + (1 - A)r2, Aq^ + (1 - A)q2 > AT(ri, q^) + (1 - A)T(r2, q2). 

Further, since the function log(.T) is continuous with continuous partial derivatives, and the directed information is 
a summation of functions of type log(a;), I(r,q) has the same properties as well. Moreover, it is simple to verify 
that the sets Ai , A2 are both convex, and we can conclude that all conditions in Lemma [T] hold for the directed 
information. ■ 

Recall, that in the alternating maximization procedure we maximize over {r(a;"| jy"^^), (?(a;"|y")} instead of 
over r(a;"||y"~^) alone, and thus need the following lemma. 

Lemma 3 . For any discrete random variables X", F", the following holds 

max_ I{X" y") = max I{X" ^ F"). (21) 

The proof will be given after the following supplementary claim, in which we calculate the specific q{x"\y^) that 
maximizes the directed information where r(.T"||j/"^^) is fixed. 

Lemma 4 . For fixed r(a;"||y"^^), there exists C2(r) = that achieves maxg(a;r,|j^n) /(X" — )• y"), and 

given by 

Ei:" ''(a;"lly""^)p(y"l|a^")' 

Proof for Lemma^ Let q* — g*(a;"|?;"). For any q ~ q{x'^\y"'), and fixed r = r(x"||y"~^) 
I(r,q*)-I(r,q) 

= E K-"ll2/"-^M2/"lk")log^|^!|^- E r(xniy"-^My1k")log ^gy_\^ 

X .y X ,y 



r(x"||y"-^My"||x")log '^*/^"'^„7 
|^r(x"||y"-iMy"||a;") \\ ^r(x"||y"-')p(2/"||x")j 



(a) 

> 0, 
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where (a) follows from the non-negativity of the divergence. ■ 

Proof of Lemma^ After finding the PMF q that maximizes I(r,q) where r is fixed, we can see that (7(2;" |?/") is 
the one that corresponds to the joint distribution r(x"||?/"~^)p(y"||a;") in the sense that 

p(y") 

r(2;"||?;"-i)p(2/"||a;") 
X^a;" r{x^'^\\y'^^^)p{y"-\\x^) ' 



and thus, the lemma is proven. ■ 

In the following lemma, we find an explicit expression for r that achieves maxr(a;ii||yii-i) I{X" i^"), where 
q is fixed. 

Lemma 5 . For fixed (7(a;"|j/"), there exists ci{q) = r*(x"||y"^^) that achieves max^j^jnUyn-i) I{X^ — > F"), and 
is given by the products; 

n 

r*{x-\\y"-')^Y[r{x,\x^-\f-^), 



where 



and 



rix,\x'-\y'-') 



r'ix^y"--^)' 



(22) 



(23) 



Proof: In order to find the requested r, we find all of its components, namely {r{xi\x''~^ ,y^~^)}"^^, by 
maximizing the directed information over each of them. For convenience, let us use for short: = r{xi |a;*~^, y'^^), 
and Pi = 2/*^^). Since in Lemma|2]we showed that /(X" F") is concave in {r,q} and the constraints 

of the optimization problem are affine, we can use the Lagrange multipliers method with the Karush-Kuhn-Tucker 
conditions [25, Ch. 5.3.3]. We define the Lagrangian as: 



Now, for every i £ {1, n} we find s.t.. 



dJ 



i-1 



g(x"|y") 
log -7=^ 1 



u^, E \piyv) n 



log n ~ - 1 

i=i 



Ilj=i+i '^j 
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= 0. 



Note that since is a function of (x* ^) we can divide the whole equation by 11^=1 '"i' ^^'^ S-^^ ^ 



Moreover, we can see that three of the expressions in the sum, i.e., {lognj=i '^j i log'r'i, 1}, do not depend on 
{x2j^i,y"), thus leaving their coefficient in the equation to be 



E 



J-1 



Hence we obtain: 



log 



p(y" I n"=i-|.i '■j 



n 



log^i -l0g<(:,.-l,j;.-l) = 0, 



where 



i=i V .7=1 



^3 I ~ '^i.lx'-i.y'-i)- 



Finally, we are left with the expression; 



where 



n 



P(y"ll=i=")n"^i + l r(xj|x3-l,aj-l) 



(24) 



We can see that for every i, depends on q{x^'^\y") and {r^+i, ri+2i fji}, and ?-„ is a function of alone. 
Therefore, we can place r„ in the function we have for r„_i, thus making r„_i depend on alone as well. 

Now we do the same for r„_2 and so on until for all i, ri is dependent on (7(a-"|y") alone. We name this method 
Backwards maximization. Finally, we obtain r(x"||y"^^) ~ Y[i=i ''i '^hat maximizes the directed information where 
is fixed, i.e., ci{q), and the lemma is proven. ■ 



Having Lemmas |2]|5] we can now state and prove our main theorem. 



Theorem 1 . For a fixed channel there exists an alternating maximization procedure, such as II in Alg 

[T] to compute 

Cn = - max /(X" ^ y"). 

np(^„l|,y„-i) 
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Proof: To prove Theorem [T] we first have to show existence of a double maximization problem, i.e., an 
equivalent problem where we maximize over two variables instead of one, and this was shown in Lemma [3] Now, 
in order for the alternating maximization procedure to work on this optimization problem, we need to show that 
the conditions given in Lemma [T] hold here, and this was shown in Lemma ID |4] and |5] Thus, we have an algorithm 
for calculating 

Cn = - max /(X" ^ r") 

n r(x"\\y"-^) 

that is equal to limfc^oo lL{k), where /^(fc) is the value of II in the fcth iteration as in Alg. [1] Hence, the theorem 
is proven. ■ 

Our last step in proving the convergence of Alg. [T]is to show why In is a tight upper bound. For that reason 
we state the following theorem. 

Theorem! . For the value of C„ = ^ maxp(2.n||j,n-i) /(X" 5^"), the inequality 

Cn<Iu, (25) 



where 



1 -„...V-„...«M^n^,„„ p{yV) 



mm max 



.^...^ > max---max> log — m , \ i , w ^\ 

n r XX ^ — ' 2:2 Xr^ ^ — ' _p(2/"||a; ) ■ ?'(a; ||y" ) 

holds. Furthermore, if r(a::"||j/"^^) achieves C„, then we have equality in jZSl ). 

The proof is given in Appendix |B] for the general case of delay d. We also omit the proof of the upper bound for 
the case where the feedback is a deterministic function of the delayed output, as described in Appendix [A] 

V. Numerical examples for calculating feedback channel's capacities 

In this section we present some examples of Alg. [T] performances over various channels. We start with a 
memoryless channel to see whether feedback improves the capacity of such channels, and continue with specific 
FSCs such as the Trapdoor channel and the Ising channel. Since Alg. [1] is appUcable on Finite State Channels 
(FSC), we describe this class of such channels and their properties. Gallager ||26l defined the FSC as one in which 
the influence of the previous input and output sequence, up to a given point, may be summarized using a state with 
finite cardinality. The FSC is stationary and characterized by the conditional PMF p{yi, Si\xi, Si_i) that satisfies 

p{yi, Sj|a;*, s'"^) = p{yi, Si\x^,s^-i), 

and the initial state p(so). 

The causal conditioning probability of the output given the input is given by 
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and 

p(2/"||x")=^p(y"||x",so)p(so). 

So 

Note that a memoryless channel, i.e., the output at any given time is dependent on the input at that time alone, is 
an FSC with one state. 

It was shown in ||9l that the capacity of an FSC with feedback is bounded between 

log \S\ log \S\ 
Lln j;^<Cn<<^n^ (26) 

where 

C=— max max/(X" y"Lso), (27) 

N p(x-\\yr^-^) so 

C=— max min/(X" y"|so)- (28) 
If we require that the probability of error tends to zero for every initial state sq, then 

C = lim 

Since these bounds are obtained via maximization of the directed information, we can calculate them using Alg. [T] 
as presented in Section HUl thus estimating the capacity. 

Our first example shows the convergence of Alg. [T]to the analytical capacity of a memoryless channel. 

A. Binary Symmetric Channel 

Consider a memoryless BSC with probability of p = 0.3 as in Fig. [T] The capacity of this BSC is known to be 




0.7 

Fig. 1: Binary Symmetric Channel 

C = 1 — -ff (0.3) = 0.1187. In Fig. 2 we present the directed information upper Ijj and lower II bounds as a function 
of the iteration (as given in Alg. [Til and compare it to the capacity that is known analytically. Shannon showed ||27| 
that for memoryless channels, feedback does not increase the capacity. Thus, we can expect the numerical solution 
given in Alg. [T] to achieve the same value as in the no-feedback case. Indeed, we can see that as the iterations 
number increases, the algorithm approaches the true value and converges. Furthermore, the causally conditioned 
probability r(a;"||y"^^) that Alg. [T] achieves is actually r{x"'), i.e., does not depend on the feedback. We note here 
that we can achieve the capacity of the channel using a uniform distribution or r(a;"). This does not imply that 
there is only one optimum distribution, and indeed the one that Alg. [T| achieves is not uniform. 
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\Upper bound, /[; 



* True capacity 

6 - 

Lower bound, II 



HL 25 . 30 

Iteration 



4D 45 50 



Fig. 2: Performance of Alg. [T|over BSC(0.3). The lower and upper lines are the bounds in each iteration in Alg 
[U whereas the horizontal line is the analytical calculation of the capacity. 



B. Trapdoor Channel 

1) Trapdoor channel with 2 states: The trapdoor channel was introduced by David Blackwell in 1961 1281 and 
later on by Ash ||29l . One can look at this channel as such: Consider a binary channel modulated by a box that 

Channe l _ 
Input I Output 

®®®®0/ /O®®®® 



Fig. 3: Trapdoor Channel 



contains a single bit referred to as the state. In every step, an input bit is fed to the channel, which then transmits 
either that bit or the one already contained in the box, each with probability i. The bit that was not transmitted 
remains in the box for future steps as the state of the channel. The state, thus, is the bit in the box, and since it 
can be '0' or '1', we conclude that \S\ = 2, or log \S\ — 1. 

In order to use Alg. [T] we first have to calculate the channel probability sq). For that purpose, we find 

p(?/i|a;*, y*^^, So) analytically. Note that i/'^^, sq) = p{yi\xiT Si-i). Thus, first we find the deterministic 

function for s,;_i given the past input, output, and initial state, i.e., (a-*~^, y'~^, sq), and then the function for 
p{iji\x\y^~^ ,so) = p{yi\xi,Si-i). An examination of the truth table in Table |ll] yields the formula for Si_i as 

Si-i = Xi-i © yi-i © Si-2 

m—i—l 

— {Xm © 2/m) © So- 

m— 1 

Note that in Table HIl the input series (0,0,1) and (1,1,0) are not possible since the output is not one of the 
bits in the box; thus we may assign to Si_i whatever value we choose, in order to simplify the formula. As for 
the conditional probability p(i/i|x*, sq), we assume that so = 0, and because of the channel's symmetry the 
outcome for so = 1 is easily calculated. Looking at Table HID we can see that the formula for p{yi\x'^ ,y^~^ , sq = 0) 
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TABLE II: Si-i as a function of Xi^i, Si_2 and TABLE III: p{yi\si^i,Xi) 



Xi-l 




Vi-i 


Si-1 






Si-1 




p[yi\x ,y ,so = 0) 
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0.5 
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1 
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1 
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1 


1 



is given by 



piyi\x\y' \so = 0) = -{x, © s,_i) + {xi e s,_i) A {x^®y^), 

where we know that Si_i is a function of (x'^^, y^^^, So), and A denotes AND. 

Now that we have p{y'"-\\x" , = 0), we use Alg. [T]for estimating the capacity of the channel as we run the 
algorithm to find the upper and lower bound for every n E {1..12}, where 



C„ = max max -I{X"- ^ Y"\so) + 

So r(2-'>||jf'>-i) n n 

C„ = max min -I(X" y"|so) - -• 
r(x"||i/"-i) so n n 



(29) 
(30) 



Note that ( |29b is calculated via Alg. [T] and sq = due to the channel's symmetry. However, calculating ( l30t is 
more difficult, since we have to maximize over all the probabilities and at the same time minimize 

over the initial state. Hence, we use another lower bound denoted by C_*, for which r(a;"||y"^^) is fixed and is 
the one that achieves the maximum at ( [29] l, and we only minimize over sq. Clearly, C_* < C_. Fig. |4] presents the 
capacity estimation, and the upper and lower bound, as a function of the block length n. In 1211 . the capacity of 



Trae cap. 

C 



a: 



1 Z 3 4 5 6 7 B 9 10 11 12 

n 

Fig. 4: Plot of Cn, Cn, and the true capacity of the trapdoor channel with 2 states and feedback with delay 1. 
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the trapdoor channel is calculated analytically, and given by 

C = lim Cn = log f ^ w 0.69424191. (31) 

We see from the simulation that the upper and lower bounds of the capacity approach the limit in (ISTl i. and the 
estimated capacity at block length n = 12 is C12 = 0.6706533. 

2) Directed information rate as a different estimator for the capacity: We now consider an estimator to the 
feedback capacity of an FSC by calculating {n + l)C„+i — nC„. The justification for this estimator is based on 
the following lemma. 

Lemma 6 . If lim„^oo I{X"; F„|F"-i) exists, then 

1 



lim -/(X" Y") = lim (/(X" ^ y") - ^ Y''-^)) 



I.e., 



lim Cn = lim (n + l)C„+i — nC„. 

n—^oc n— foo 

Proof: If we suppose that the limit above exists, then 

lim (/(X" ^ y") - ^ r"-i)) = lim /(x";r„|r"-i) 

lim -V/(x^y|y-l) 



= lim -I{X" F"), 

where (a) follows from the fact that if the limit of the sequence {a„} exists, then the average of the sequence 
converges to the same limit. Further, a result from Q provides that if the joint process {Xi, Yi} is stationary, then 
the limit lim„_,oo /(^"; FnlF""^) exists. ■ 

Fig. |5] presents the directed information rate estimator using the lemma above, and its comparison to the true 
capacity. One can see that the convergence of (n + l)Cn+i — nCn is faster than C„ and the upper and lower 



0)"'-- 

>^^^ (n + l)C„+i - nC„ 

0.71 - 

"True cap. 

1 2 3 4 5 ^ 7 a 9 10 11 

Fig. 5: The upper line is (77, + l)Cn+i — nCn calculated using Alg. [T] and the horizontal line is the analytical 
calculation, for the trapdoor channel with 2 states and feedback with delay 1 . 
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bounds as seen in Fig. |4] and achieves the value 0.6942285 when we calculate the 11*'' difference. Furthermore, 
the convergence of the directed information rate stabilizes faster. 

3) M-State Trapdoor channel: We generalize the trapdoor channel to an M-state one. In the previous example 



Input [ 

(D®®®0 



Channel 



Output 



V 

m cells 

Fig. 6; Trapdoor channel with M states. 

we had AI = 2 cells in the box, one for the state bit, and one for the input bit. One can consider the state to be 
the number of Ts in the channel before a new input is inserted. We can expand this notation, by letting the 'box' 
contain more than 2 cells as presented in Fig. |6] Here, the state at any given time will express the number of I's 
that are in the box at that time, and each cell has even probability to be chosen for the output. In this case, M 
cells in the box are equivalent to M states of the channel. By that definition we can see that the state Si-i as a 
function of past input, output, and the initial state is given by 

Moreover, for calculating the channel probability p{yi = l|a;\ sq), we add Si_i to Xj and divide the sum by 
the number of cells, i.e.. 



Sj_i + Xi 



Now that we have p(?/"||a;", Sq), we use Alg. [T]for calculating C„ for every n E {1, 2, 12}. Fig. Qpresents the 
directed information rate estimator (n + l)C„+i — nC„ for the trapdoor channel with A/ = 3 cells. Note, that in 



I" 



Fig. 7: Plot of {n + l)C„+i — nC„ for the trap door channel with 3 cells and feedback with delay I. 



Fig. |7] we achieve the value 0.5423984 in the 11*'' difference, thus we can assume that the capacity of a 3-state 
trapdoor channel is approximately 0.542. 
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4} Influence of the number of cells on the capacity: To summarize the trapdoor channel example, we examine 
the way the number of cells affects the capacity. The estimation use is the directed information rate, with n ~ 12. 




1 ' ' ' ' ' ' ' ' ' ' ' ' ^ 

1 2 3 4 5 6 7 B 3 10 11 12 13 

Cells 



Fig. 8: Change of 12Ci2 — llCn over the number of cells in the trapdoor channel with feedback with delay 1. 

In Fig. [8] we can see that the capacity decreases as the number of cells increases and approaches zero. 
C. The Ising channel 

The Ising model is a mathematical model of ferromagnetism in statistical mechanics. It was originally proposed 
by the physicist Wilhelm Lenz who gave it as a problem to his student Ernst Ising after whom it is named. The 
model consists of discrete variables called spins that can be in one of two states. The spins are arranged in a lattice 
or graph, and each spin interacts only with its nearest neighbors. 

The Ising channel is based on its physical model, and simulates Intersymbol Interference where the state of the 
channel at time i is the current input, and the output is determined by the input at time i + The channel (without 

Xn — Xn 1 

—7 

X,i+1 i/ Vn Xn+1 ^1 Vn 

1 1 1 — ^ 1 

Fig. 9: The Ising Channel. HSO) 

feedback) was introduced by Berger and Bonomi 1301 and is depicted in Fig. |9] In their paper, they proved the 
existence of bounds for the no-feedback case. In addition, they showed that the zero-error capacity without feedback 
is 0.5. 

1 ) Ising channel with delay d = 2: We estimate the capacity of the Ising channel with feedback. Since the output 
at time i is determined by the input at times i, i + we define the channel PMF as p(2/o~^ll^"i ■^o)- Therefore, the 
feedback at time i must be the output at time i — 2, since we cannot have yi-\ before Xi-i is sent. Thus, looking 
at the Ising channel with delay d = 1 is not a practical example, and we did not examine it. We ran our algorithm 
on the Ising channel, with delayed feedback of d = 2; the results are presented in Fig.[TOl In Fig.[TO](a), we obtain 
Ci2 = 0.5459, and in (b) we achieve 12Ci2 - llCn = 0.5563 in the 11*'' difference. 
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6 7 e 3 10 li 12 1 2 3 4 5 6 

n n 
(a) (b) 



Fig. 10: Performance of Alg. [T]on the Ising channel with feedback delay of d = 2. In (a) we present C„, C„, C* , 
and in (b) we have [n + l)Cn+i — nCn- 

2) The ejfects the delay has on the capacity: Here we investigate how the delay influences the capacity. We do 
so by computing the directed information rate estimator of the Ising channel with blocks of length 12, over the 
feedback delay d = {2, 3, 12}. The formulas for estimating the capacity when the delay is bigger than 1 is given 
in Section [Till equations ( fT4l i. ( fTSl l. In Fig. [TT] we can see that, as expected, the capacity decreases as the delay 
increases. This is due to the fact that we have less knowledge of the output to use. 
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, , , 0,545l 1 ' ' ' ' ' ^ 

18 11 1 2 2 3 4 5 6 7 8 9 

Delay of feedback 

Fig. 11: Change of 12Ci2 — HCn over the delay of the feedback on the Ising channel. 

VI. CONCLUSIONS 

In this paper, we generalized the classical BAA for maximizing the directed information over causal conditioning, 
i.e., calculating 

C„ = - max /(X" ^ y"). 

The optimizing the directed information is necessary for estimating the capacity of an FSC with feedback. As we 
attempted to solve this problem we found that difficulties arose regarding the causal conditioning probability we 
tried to optimize over. We overcame this barrier by using an additional backwards loop to find all components of 
the causal conditioned probability, separately. 
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Another application of optimizing the directed information is to estimate the rate distortion function for source 
coding with feed forward as presented in ||3TI . 1321 . Il33l . In our future work 1341 . we address the source coding with 
feedforward problem, and derive bounds for stationary and ergodic sources. We also present and prove a BA-type 
algorithm for obtaining a numerical solution that computes these bounds. 



Appendix A 

General case for channel coding-Feedback that is a function of the delayed output 

Here we extend Alg. [T] given in Section |IV] for channels where the encoder has specific information about the 
delayed output. In this case, the input probability is given by r(a;"||z"~''), where z,; = f{yi) is the feedback, and 
/ is deterministic. In other words, we solve the optimization problem given by 



The optimization problem is associated to Fig. \l2\ 



max /(X" ^ y"). 



M- 



Encoder 



P(2/"IN") 



Decoder 



■Af(y") 



Fig. 12: Channel with delayed feedback as a function of the output. 

The proof for this case is similar to that of Theorem[T] except the steps that follow from Lemmas |4] and |5] Lemma 
Uproves the existence of an argument (7(a;"|j/") that maximizes the directed information where r(a;"||?;"~^) is fixed. 
The modification of this lemma is presented here, where we find the argument qlx^ly") that maximizes the directed 



information where r{x 



n 1 1 ^n — d 



) is fixed; the proof is omitted. Therefore, the maximization over (7(2;" |?/") where 



^) is fixed is given by 



r(x"||2"-'' 



My" Ik" 



r{x"\\z"'-<i)p{y"^\\x"')' 

y.n I L,n— 1 ^ 



Lemma |5] proves the existence of an argument r(x"||?/" ^) that maximizes the directed information where 
g(a;"|i/") is fixed. We replace this lemma by Lemma|7] 

Lemma 7 . For fixed there exists ci(g) that achieves maxr(2.„||2„-d) /(X" — )■ F"), and given by 



rixAx^-Kz"-") 



i=l 



where 



r'ix^z^-'^) 



(32) 
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and 



')n?=i. 



r [X ,z 



n 



n 



(33) 



Proof: We find the products of r(x"||z"~'^) that achieve maximum for the directed information. For 
convenience, let us use for short: r; ^ r{xi\x^~^, z*""^), and pi = p{yi\x'- ,y^^^). As in Lemma[2]we can omit that 
/(X" — )■ F") is concave in {r(a;"||i/"~''), ^(x"!?/")}. Furthermore, the constraints of the optimization problem 
are affine, and we can use the Lagrange multipliers method with the Karush-Kuhn-Tucker conditions. We define 
the Lagrangian as: 



J= J2 p(2/"||a;")n^^log 



Now, for every i g {l..n} we find r.i s.t.. 



9J _ ^ 



p{yv) n 



log^gM-l 



En-. E |p(2/"iK)n 



j=i+i 



log 



n 



= 0, 



where the set Ai,d,z = {y^ : ~ f{y^ stands for all output sequences s.t. the function in the delay 
maps them to the same sequence 2*^"^, which is the feedback. 



Note that since Jlj^i '"i '^o^^ not depend on Ai j^ ^, we can take the product out of the sum. Furthermore, since 
Vi is a function of (a;*"^, z'"*^) we can divide the whole equation by the product above, and get a new h'* ^^^i-i yj~dy 
Moreover, we can see that three of the expressions in the sum, i.e., {logHj^i '"ji log?";, 1}, do not depend on 
(a;f_|_]^, thus leaving their coefficient in the equation to be 

E ^'(y"!!^") n = E ^^Pr 



a;r+i,'yf_<j4,i,^i,d,j 



j=i+i 



Hence we obtain: 



loe 



n 



Y\j=i+i 



logr, -logi/** . 



where 



i—d 
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Therefore, we are left with the expression: 



„l I -,i — d \ 



where 



g(a;"|y") 



(34) 



As in Section|lV] we can see that for all i, ri is dependent on and {vi+i, 7'i+2, Vn], and r„ is a function 

of alone. Thus, we use the Backwards maximization method. After calculating r,; for all i ~ 1, ...,n, we 

obtain r(a;"||z"-'') = H^^i that maximizes the directed information where is fixed, i.e., ci{q) and the 

lemma is proven. ■ 

As mentioned, by replacing Lemmas |4] |5] by those given here, we can follow the outline of Theorem [T| and 
conclude the existence of an alternating maximization procedure, i.e., we can compute 

C„ = - max /(X" ^ r") 

n r(2;"||z"-'') 

that is equal to Wxni^^^ lL{k), where /^(fc) is the value of in the fcth iteration in the extended algorithm. One 
more step is required in order to prove the extension of Alg.[T]to the case presented here; the existence of Ijj. This 
part is presented in Appendix Ib] 



Appendix B 
Proof of Theorem[2] 

Here, we prove the existence of an upper bound, Iir, that converges to C„ from above simultaneously with the 
convergence on to it from below, as in Alg. [T] To this purpose, we present and prove few lemmas that assist 
in obtaining our main goal. We start with Lemma |8] that gives an inequality for the directed information. This 
inequality is used in Lemma |9] to prove the existence of our upper bound which Lemma [TO] proves to be tight. 
Theorem |2] combines Lemmas |9] [Tol 

Lemma 8 . Let — )■ F") correspond to ri(x"||y"^''), then for every ro(a;"||?/"^''), 

p(y"||x") 



Piy^'Wx'") ■ ro{x"^\\y'^-'^)' 
Proof: For any ri(x"||y"~'*), 7'o(a;"||y"~''), 



y ri(x"||y"-'*) y p(y"||x")log= , .^^^ ^ \ , ,, 



J2 n(a^"ll2/""')-p(2/"l|2^")log 



p{y"^\\x"■) 

p(y"||a;") 



Ex'"P(y"lk'")-^o(a;'"||y"-^) 
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x'\y" 



^E....-rK.i...o. ^:-yil::::i:::g:|l;:::i 



(a) 



^bi(y")lbo(y")) 



(6) 

> 0, 

where in (a), pa{y^^) and are the PMFs of that corresponds to ro(a;'"||i/"~'') and ri(a;'" ||y"~''), and (b) 

follows from the non negativity of the divergence. Thus, the lemma is proven. ■ 

Our next lemma uses the inequality in Lemma [S] to show the existence of the upper bound, which is the first 
step in proving Theorem |2] 

Lemma 9 . For every ro(x"| |?/"~'*), 
where 



Iij ~ — max max • • • max 1 1 2;" ) log ■ 



ln\\,j,n~d\ 



Proof: To prove this lemma, we first use lemma [8] For every ri(a;"||y" ro(a;"||y" ''), 

(a) , , , ^ nftf^-Wr"') 



+1 



< ftri(a;,|x'-\y'-'')max V p(2/"||a;") log = Py ik") ^ 

V ' 
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< vfTri(x,|x*-\2/*-'') VmaxV •••max V p(y"||x") log = 



< ^ JJ^ ri(xi|a;' \ '') max max • • • max ^ log 



max max • • • max IN") log — 

•11, llr, ll"- ^— 



p(y"||x") 



p(y"l|a;'") ■ ro(x'"||w"-'') ' 
•yi y2 y;'-d+i ^:.-i'vy ii i v\ \\y j 

where (a) follows Lemma |8] (b) follows from maximizing an expression over .t„, and (c) follows from the fact 
that the expression in the under-brace is a function of x"^^ ,y"'^'^, and we can take it out of the summation over 
Xn and use r(x„|x"~^, j/"^"^) = 1. The rest of the steps are the same as (b) and (c), where we refer to a 
different Xi. 

Since the inequality above is true for every ri(a;"||t/"~'^), we can use it on rc(x"||?/"~'') that achieves C„, and 
thus for every ?'o(x"||y""'^) 



Cn < — max y^ max y^ • • • max y^ log : 



p(y"||a;") 



n xd xa+i x„ ^-^ " ' X^a;'" ■ ^o(a^'"||2/" 

yi Vn-d+i 

This is also true for every ro(x"||y"^'^), and hence for the minimum over all ?'o(x"||j/""'*), and we obtain 

Cn < — minmaxy^ maxy^ • • • max y^ p(w"||a;") log •= ; — \—r-r, it, 

n ra xd ^xd+i^ x„ ^ -^^^ " ' V^,„_p(?/" x'") • ro(x'" y"-'') 

yi yi y^-d+i 

and the lemma is proven. ■ 

The next part of Theorem |2] is to show that the bound is tight. 

Lemma 10 . The upper bound in Lemma |9] is tight, and is obtained by r(x"||?/"^'') that achieves the capacity. 

Proof: In Lemma|9] we showed only half of the proof of the theorem, i.e., the existence of an upper bound. To 
prove this lemma, we need to show that this inequality is tight. For that purpose, we use the Lagrange multipliers 
method with the KKT conditions with respect to all r{xi\x'^^^ ,y^^'^)s. We can use the KKT conditions since the 
directed information is a concave function in all r{xi\x^~^ ,y'^^'^)s, as seen in Lemma |3] 
We define the Lagrangian as 

p{y"\\x"-) 



J= J2 ^(^"ll2/""')-p(y"lk")iog^ 



YI '^■»,(^'-i,y'-'')(Xl''(^*l^' ^'y' " ^) +11 Y h,^{x^^i^y^-d)r{x,\x' \y' '^'^ 

i=l x^^^,y^^d Xi i=l x'--^,y^-^d 



Now, for every r(xi|a;' ^,y^ we have 
dJ 



J2 rix,+i\x\y'-''+')--- Y rixnlx^-^y"-")- 



drixAx"^ 1,7/' 

xi+i,yi-d+i x„,y„-d 
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Setting = we are left with two cases. For r(xi|x'^^, j/*^'^) > the KKT conditions requires us 

to set hi = and we obtain 



p(y"IN") 



whereas for r(xi|a;*~^, y*^'') = we set hi > and the equality becomes an inequality. 

We now analyze our results for the case where r{xi\x^^^ ,y'^^'^) > 0. First, we note that for i ^ n we have that 

, „|| 

and thus constant for every x„. As a result, for i = n — 1 we have 

Piy"\\x") 



Y r{xn\x"-\y"-') Y P(y"lk")log^ 



= E E P(y"l|2:") log ^^^^ p(j^n]]|,^,n) . ^j^/n||yn-d) 

an-d Vn-d+l 

that again, is constant for every Xn-i- We can move backwards and obtain that for i = 1, 

Yrix2\xi)---Yri^d\x''-') Y r{xd+i\x''-,yi)---rix^\x"-\y"-''y 

X2 Xd Xd+i.y 

Y p(y"||a;")log: 



Xd+i.yi 

p{y"'\\x"-) 
Ex'" Piy"\\x'") ■ r(x'"||y"-'i) 

Vn-d+l 



Er{x2\xi)'max} max- ••max > log : 



xd+1 x„ ^ — ' Yl,x"^piy^\W^) ' ''^i^'^Wv^ 

Vn-d + l 



= max max • • • max p(u"||a;") log -^^ -. — '■ 



• r(a;'"||y"-'') 



max max • • • max log 



p(y"||a;") 



_p(y"||a;'")T(x'"||y"-'*)' 

Using the analysis above, we find an expression for Cn where r(a;"||j/"~'') achieves it. In the following equations 
we can assume that r(x"||y"~'') > 0, since otherwise, for the specific x", y", the expression for C„ will contribute 
to the summation. 



= - E Ka^"lly""'') •p(y"IN")iog- 
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= ^^r{xi)^r{x2\xi)---^r{xd\x'^ ^) ^ r(xd+i |a;'', yi) 

xi X2 xa xa^i,yi 

•••r(x„|x"-\y"-'^) P(y"l|a;")log^ PivV) 



— y^r(xi)inaxy^inax- • -max p(y x ) log 

?l x"^ a:„ > 



p(y"||a;'") ■ r(x'"||2/"-'') 



('^) 1 ^ nil nM 

= — max> max- ••max > p(w a; ) log = ^ — -; rr, 

where (a) is due to the analysis above for i = 1. We showed that the upper bound is tight, and thus the lemma is 
proven. ■ 

Now we combine both lemmas to conclude our main theorem. 

Proof of Theorem^ As showed in Lemma |9] there exists an upper bound for C„. Lemma [TOl showed that 
this upper bound is tight, when using the PMF r(a;"| that achieves C„. Thus, the theorem is proven. ■ 

Generalization of Theorem |2] We generalize Theorem |2] to the case where the feedback is a delayed function of 
the output (as presented in Appendix [All. We recall, that the optimization problem for this model is 

max /(X" ^ r"). 

r{x"\\z"-'') 

While solving this optimization problem, we defined the following set: Ai^d,z = {y' : 2' = namely, 
all output sequences y^~'^ s.t. the function in the delay sends them to the same sequence z'^~'^. We use this notation 
for the upper bound. In that case, the upper bound is of the form 

In = — max max • • • max I log ; — [^^ } , ,, -rr. 

n x" xa + l ^ x„ /-^ ^ i-yn W I f / „M,,„x ^/ ,„ 

^1 2>>-d A„.d,, II / V II I 

The proof for this upper bound is omitted due to its similarity to the case where zi — yi for all i, i.e.. Theorem 
121 Moreover, one can see that this is a generalization, since if indeed zi = yi, then An^d,z has only one sequence, 
yn-d, and the equation for lu coincides with the one in Theorem |2l 
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